
README_dataset_description.txt

Dataset title
Syntactic profiles of large language models in Spanish academic discourse

Authors
[Author name]
[Institutional affiliation]

Repository
Harvard Dataverse

1. Overview
This dataset contains syntactic metrics extracted from a corpus of Spanish academic texts generated by three large language models (LLMs): ChatGPT, Claude, and Gemini. The dataset was created to examine whether different LLM architectures produce distinguishable syntactic profiles when generating academic discourse in Spanish.

The dataset includes structural and syntactic features derived from dependency parsing using the Universal Dependencies (UD) framework. These features were extracted automatically through a reproducible computational pipeline implemented in Python.

All data included in this repository are fully de‑identified and contain no personal or sensitive information.

2. Corpus description
The corpus consists of academic-style responses generated by three large language models:

- ChatGPT
- Claude
- Gemini

Each response corresponds to an independent text generated under controlled prompting conditions designed to elicit academic discourse in Spanish.

Each row in the dataset corresponds to one generated text and each column represents a structural or syntactic metric extracted from the text.

3. Linguistic processing pipeline
The linguistic processing pipeline was implemented in Python using the Stanza natural language processing library, which provides neural models for tokenization, part‑of‑speech tagging, and dependency parsing.

The analysis relies on the Universal Dependencies (UD) annotation framework.

Processing steps included:
1. Text tokenization
2. Sentence segmentation
3. Morphosyntactic tagging
4. Dependency parsing
5. Extraction of syntactic relations

The following dependency relations were extracted:

ccomp – finite clausal complements
xcomp – open clausal complements without explicit subjects
advcl – adverbial subordinate clauses
acl:relcl – relative clauses modifying noun phrases
cc – coordinating conjunction markers
conj – coordinated clause elements

All syntactic relations were normalized as occurrences per 1000 tokens to ensure comparability across texts of different lengths.

4. Variables included in the dataset

Structural metrics:
tokens – total number of tokens
sentences – number of sentences
tokens_per_sentence – mean sentence length

Syntactic metrics (normalized per 1000 tokens):
ccomp_per1000
xcomp_per1000
advcl_per1000
acl_relcl_per1000
cc_per1000
conj_per1000

These variables operationalize key dimensions of syntactic organization related to coordination and subordination patterns in academic discourse.

5. Data processing scripts

Two scripts are provided for reproducibility:

analisis_ud.py
Performs linguistic preprocessing and extracts syntactic metrics from the corpus using the Stanza pipeline.

estadistica_ud.py
Performs the statistical analyses reported in the study, including:
descriptive statistics
Welch ANOVA
Kruskal–Wallis tests
effect size estimation
MANOVA
PERMANOVA
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
bootstrap estimation of Mahalanobis distances between model centroids
generation of figures used in the article.

6. Reproducibility
The analysis pipeline was implemented in Python.

Main libraries used:
pandas
numpy
scipy
statsmodels
scikit-learn
matplotlib
stanza

All scripts included in this repository allow the full replication of the statistical analyses reported in the study.

7. Data availability
The dataset supporting the findings of this study has been structured according to the FAIR data principles (Findable, Accessible, Interoperable, Reusable) and deposited in the Harvard Dataverse repository.

The repository includes:
the dataset of syntactic metrics
the scripts used to generate the dataset and statistical analyses
documentation describing the variables and processing pipeline.

8. License
The dataset is made available for research and academic purposes under an open scientific data license.

9. Citation
If you use this dataset, please cite the associated article once published.
